IDENTIFIABILITY OF LEVEL-1 SPECIES NETWORKS FROM GENE TREE QUARTETS

When hybridization or other forms of lateral gene transfer have occurred, evolutionary relationships of species are better represented by phylogenetic networks than by trees. While inference of such networks remains challenging, several recently proposed methods are based on quartet concordance factors — the probabilities that a tree relating a gene sampled from the species displays the possible 4-taxon relationships. Building on earlier results, we investigate what level-1 network features are identifiable from concordance factors under the network multispecies coalescent model. We obtain results on both topological features of the network, and numerical parameters, uncovering a number of failures of identifiability related to 3-cycles in the network.

1. Introduction.Statistical inference of phylogenetic networks, showing evolutionary relationships between species when hybridization or other horizontal gene transfer has occurred, poses substantial theoretical and practical problems.With data in the form of many sequenced and aligned genes, standard phylogenetic methods can be used to infer gene trees.However, due to both horizontal inheritance and the population genetic effect of incomplete lineage sorting, these gene trees reflect the species network topology only indirectly.Extracting the network signal with an acceptable computational time, and even determining what aspects of the network can be inferred under the Network Multispecies Coalescent (NMSC) model, is challenging.
Several recently-developed network inference methods utilize summaries of (inferred) gene trees through counts of their displayed quartet trees, that is, empirical quartet concordance factors (CF s).SNaQ [16] using pseudolikelihood on these CF s to pick an optimal network among those of level 1. NANUQ [2] also uses quartet counts in the level-1 setting, but avoids even pseudolikelihood computations, by conducting hypothesis tests for each quartet, followed by a distance-based approach to avoid searching over networks.(PhyloNet [20] similarly uses pseudolikelihood, though with rooted triple counts and without the level-1 restriction.) While these methods strike a balance between thorough statistical analysis and computational effort, a complete exploration of what level-1 network features are identifiable from CFs under the NMSC has yet to be undertaken.First results in this direction [16] showed certain semidirected level-1 network topologies were distinguishable from those obtained by dropping a hybrid edge, and that in some cases numerical parameters were identifiable up to a finite number of possibilities, i.e., were locally identifiable.Topological identifiability was later investigated [6], establishing that semidirected level-1 network topologies are identifiable up to contraction of 2and 3-cycles and directions of hybrid edges in 4-cycles, for generic parameters.While these works provide our starting point, we seek to fill in unaddressed gaps.(Appendix A of the supplementary materials gives more detail on how this work complements its predecessors, and discusses the claims and arguments in [17] for work described in [16].) We rigorously establish what can be identified, and what cannot, from quartet CF s under the NMSC.Our concern here is only the theoretical question of identifiability.We thus delineate what might be consistently inferred by a method using quartet counts, although particular methods may not be able to do so.Our results also imply parameter identifiability results for data types from which quartet counts can be obtained (e.g., topological gene trees, or metric gene trees), although for such data it is possible that stronger identifiability claims could be established.
Our main results address identifiability of the full semidirected topology of a binary network, including hybrid edge directions (Theorem 4. 19), and the numerical parameters of edge lengths and hybridization (or inheritance) probabilities (Theorem 5.12).One interesting aspect is that the presence of a 3-cycle can generally be detected, but whether the hybrid node of that cycle can be identified or not depends on the numerical parameters.The subsets of parameters on which this question has a positive or negative answer both have positive measure, and thus neither set can be dismissed as non-generic.
The precise statements of these theorems have exceptions for cycles adjacent to pendant edges.However, these simplify if one has multiple samples per taxon.Then the semidirected network topology is generically identifiable except for the presence of 2-cycles and (sometimes) hybrid nodes in 3-cycles.If the semidirected network topology with 2-cycles removed is known, then all numerical parameters except those relating to 3-cycles and their adjacent edges are generically identifiable.
Underlying our results are analyses of algebraic varieties associated with certain small networks using computational algebra software.These lead to algebraic (polynomial equality) tests of quartet CF s for different network substructures.However, 3-cycle identifiability results depend on semialgebraic tests (polynomial inequalities).These tests were motivated by equalities found for related networks, but their construction is not purely computational.
Our identifiability results for numerical parameters are based on explicit rational formulas for parameters in terms of CF s, so if a topological network is known or proposed, one could in principle estimate numerical parameters with them.But while some of these formulas are simple, others are quite complicated, and should not be expected to provide good estimates from data.These formulas may, however, give useful initial estimates of parameters that could then be refined through optimization, such as with likelihood methods.
Sections 2 and 3 give definitions and earlier results that we use as our starting point.In Section 4 we study topological identifiability of level-1 binary networks from quartet CF s, and in Section 5 the identifiability of numerical network parameters from the same information.Section 6 discusses implications for data analysis.
In the Supplementary Materials, Appendix A explains how our results complement earlier work, and Appendix B catalogs the computational results for specific networks that underly our arguments.2.1.Rooted and unrooted phylogenetic networks.A topological binary rooted phylogenetic network N + is a finite rooted graph, with all edges directed away from the root, whose non-root internal nodes form two classes: Tree nodes have indegree 1 and outdegree 2, while hybrid nodes have indegree 2 and outdegree 1. Hybrid edges and tree edges are classified according to their child nodes.Leaves of the network are bijectively labelled by taxa in a set X. A network is metric if in addition each tree edge is assigned a positive length, each hybrid edge a non-negative length and a positive probability γ, such that for every pair of hybrid edges e, e ′ with a common child γ + γ ′ = 1.More formal definitions of phylogenetic networks appear in [18,16,6].

Definitions.
We often depict these networks with their root at the top, referring to edges and nodes as above or below one another in the natural way.
As explained more fully in [2,6], for gene quartet-based methods of inference a useful form of an unrooted network is more subtle than that for a tree, since substructures above the least stable ancestor (LSA) of the taxa [18] are undetectable by them.The topological unrooted phylogenetic network induced from N + , is the semidirected network N = N − , obtained by N + by deleting all vertices and nodes above the LSA, undirecting tree edges, and suppressing the LSA.Since our concern in this work is the identifiability of unrooted phylogenetic networks, we will often use N rather than the more cumbersome N − to denote them.We refer to N as unrooted or semidirected interchangeably.Note that N naturally inherits a metric structure if N + has one.
Figure 2.1 shows an example of a network N + and its semidirected network N − .While in that example all leaves are equidistant from the root of N + , we do not assume that generally.
Tree edges can be further partitioned into cut and noncut edges, according to whether their deletion results in a graph with 2 connected components or not.Note that hybrid edges are never cut edges.
Of particular interest are unrooted networks on four taxa obtained from a larger network by restricting the taxon set.Recall that a binary unrooted topological tree on four taxa a, b, c, d is called a quartet ab|cd if deletion of its sole internal edge gives connected components with taxa {a, b} and {c, d}.When n ≥ 4, an n-taxon tree displays a quartet ab|cd if the induced unrooted tree on the four taxa is ab|cd.A formal extension of this concept to quartet networks follows.
Definition 2.1.Let N + be a rooted network on X, and let a, b, c, d ∈ X.The induced quartet network on a, b, c, d is the unrooted network obtained by 1. retaining only the nodes and edges of N + ancestral to at least one of a, b, c, d, 2. suppressing nodes of degree 2, and 3. unrooting the resulting network.If N + is a metric network, its quartet networks naturally are as well.
An analogous definition for induced quartet networks of N is given in [6], which also shows the quartet networks induced from N + and N are isomorphic.Figure 2.2 shows some metric quartet networks induced from the networks of Figure 2.1.

2.2.
Level-1 networks.We restrict our study to the family of level-1 phylogenetic networks.These have been the focus of many works [12,13,11,15,16,6,2], though only a few of these incorporate the coalescent model that is central here.
By a cycle in either a rooted or unrooted phylogenetic network we mean a set of edges and nodes that form a cycle when all edges are treated as undirected.
Definition 2.2.Let N be a (rooted or unrooted) binary topological network.If no two cycles in the undirected graph of N share a node, then N is level-1.
In some works level-1 networks are defined as those in which no cycles share an edge; i.e., cycles are edge-disjoint rather than the stricter vertex-disjoint condition we adopt.However, in our context of binary networks they are equivalent [15].
In a level-1 network a cycle that is composed of m edges, (2 hybrid edges and m − 2 tree edges) is said to be an m-cycle.More specifically, it is an m k -cycle if there are exactly k taxa descended from its unique hybrid node [6].This terminology can be used for semidirected networks, since 'descended from a hybrid node' is unambiguous, regardless of where the network is rooted.
Let N be an unrooted level-1 network on X with an m-cycle C. Then C induces a partition of X into m subsets according to the connected components obtained by deleting all edges in the cycle.Elements of this partition are the blocks of C. The hybrid block of C is the block of taxa descended from the hybrid node in C. If the blocks of C have n 1 , n 2 , . . ., n m taxa, then we say C induces a (n 1 , n 2 , . . ., n m ) partition.

The Network Multispecies Coalescent
Model and quartet concordance factors.The Network Multispecies Coalescent (NMSC) model [14] mechanistically describes the formation of gene trees within a species network, as gene lineages are traced backward in time to common ancestors in the edge populations of the network.Under it, gene trees may differ in topology from any displayed trees on the species network.Given a metric rooted phylogenetic network, the NMSC assigns positive probabilities to all resolved metric gene trees, and, through marginalization, to topological gene trees and induced gene quartet topologies.Definition 3.1.Let N + be a metric rooted network on a taxon set X, and A, B, C, D a gene sampled from individuals in species a, b, c, d ∈ X respectively.The (scalar) quartet concordance factor CF ab|cd = CF ab|cd (N + ) is the probability under the NMSC on N + that a gene tree displays the quartet AB|CD.The (vector) quartet concordance factor CF abcd = CF abcd (N + ) is the triple of concordance factors of each possible quartet on the taxa a, b, c, d.
That CF s for quartet networks depend only on the semidirected quartet network, was proved in [6].That result implies the following.Lemma 3.2.Under the NMSC on a level-1 network N + the values of the quartet CF s depend only on the induced semidirected network N .
Following on the first steps investigating level-1 network identifiability from quartet CF s taken in [16], the next result, that most topological features of a level-1 species network are identifiable from quartet CF s, appeared in [6].
Theorem 3.3.[6] Let N be a binary semidirected metric level-1 species network.Let N ′ be the semidirected topological network obtained from N by contracting all 2and 3-cycles, and undirecting hybrid edges in 4-cycles.Under the NMSC model with generic numerical parameters, the network N ′ is identifiable from quartet CF s for N .
We take this theorem as our starting point, and in Section 4 focus on the remaining questions of topological identifiability: From quartet CF s can any aspects of 2-cycles or 3-cycles can be identified, and for 4-cycles can the hybrid node be identified?In Section 5 we turn to identifiability of the numerical parameters of edge lengths and hybridization probabilities.While these were not a focus in [6], partial results on local identifiabiity of numerical parameters were given in [16].
Unless explicitly stated otherwise, we assume that exactly 1 gene lineage is sampled per taxon.If 2 lineages were sampled for a taxon, say a, 'pseudotaxa,' a 1 and a 2 can be introduced by attaching a cherry leading to these at the leaf a of the network.Under the NMSC, CF s for the modified network with 1 sample from each a i are identical to those for the original network with 2 samples from a. Sampling more than 2 lineages per taxon only introduces new CF s in which 3 or 4 pseudotaxa from the same taxon appear, but due to exchangeability of lineages under the NMSC these CF s are always 1/3.Thus identifiability results for any multiple sampling scheme will follow from the single sample case on a modified network.No edge lengths are needed in the pseudotaxa cherries, since no coalescent event may occur on them.
Under the NMSC one can derive formulas for CF s for any fixed network in terms of the numerical parameters.These have the form of polynomials in the hybridization parameters γ and the exp(−t) for all edge lengths t.The expression exp(−t) has a simple interpretation as the probability that two gene lineages entering an edge of length t coalescent units (tracing time backwards) do not coalesce within that edge.By reparameterizing using edge probabilities ℓ = exp(−t) ∈ (0, 1] rather than lengths t ∈ [0, ∞), all formulas for CF s are given by polynomial formulas in the ℓs and γs.
The 3 n 4 scalar quartet CF s for a fixed topological network N on n taxa then define a polynomial map from the numerical parameter space into R 3( n 4 ) .Extending the map to allow complex ℓ, γ, gives a parameterized algebraic variety.The set of multivariate polynomials in the CF s that vanish on the parameterization's image is an ideal, denoted I(N ) = I(N + ) = I(N − ).The zero set V(N ) of the polynomials in I(N ) is the Zariski closure of the parameterized variety.These notions from applied algebraic geometry provide a framework for our work.Elements of I(N ) are called invariants, and depend only the network topology, and not its numerical parameters.
Our arguments use symbolic computations with CFs from specific networks, performed and verified by the software Singular [7] and Macaulay2 [9].Despite their essential role, for brevity all computational results are stated in the supplementary materials, Appendix B. That section also contains an exposition of certain linear invariants that can be derived without computation, and which help simplify both computations and statements of results.

Identifiability of semidirected network topologies.
4.1.2-cycles.We first show 2-cycles (parallel edges) in level-1 networks are never identifiable.By replacing a 2-cycle with parental node u and child node v by an edge, we mean removing its two edges, introducing a new directed edge (u, v) with a specified edge probability, and suppressing resulting nodes of degree 2.
The content of the following Lemma was essentially given in [16], and has appeared in other works subsequently, most recently [5].We restate it here for completeness.Lemma 4.1.Let N + be a level-1 rooted binary metric phylogenetic network, with a 2-cycle composed of hybrid edges with edge probabilities h 1 , h 2 , and corresponding hybridization parameters γ 1 , γ 2 = 1−γ 1 .Then quartet CF s for N + under the NMSC are unchanged if the 2-cycle is replaced by an edge with edge probability ℓ ∈ (0, 1) determined by the equation . Since varying the 2-cycle parameters in the above expression causes ℓ to range over the full interval (0, 1), we obtain the following.Corollary 4.2.Using quartet CF s, under the NMSC a topological level-1 phylogenetic network N with a 2-cycle cannot be distinguished from the network N obtained by replacing that two cycle with an edge.4.2.3-cycles.Using Theorem 3.3, the question of identifying topological 3cycles in a network is reduced to distinguishing between the network that theorem identifies, and networks obtained from it by replacing some set of non-cycle tree nodes with 3-cycles.We only consider networks with 5 or more taxa, as the 4-taxon case is fully studied in [6].

3-cycles near leaves.
We begin with a non-identifiability result, for certain 3-cycles adjacent to two pendant edges of a network, as shown in Figure 4.1.
Proposition 4.3.Suppose a binary semidirected network N on n ≥ 4 taxa has a 3-cycle C inducing a (1, 1, n − 2) partition of the taxa.Let N ′ be the network obtained by contracting C to a node.Then under the NMSC 1.If C is a 3 1 -cycle, so its hybrid node has only 1 descendant taxon, the topologies of N and N ′ cannot be distinguished using quartet CF s.That is, for any choice of parameters on one of these networks, there exist parameters on the other giving identical CF s.Moreover, the parameters other than those associated to C and internal edges adjacent to C may be chosen to be identical on both networks.2. If C is a 3 k -cycle with k = n−2 ≥ 2, and the parameter spaces are extended to allow all real edge lengths in the CF formulas, then for any choice of extended parameters on N there are extended parameters on N ′ giving identical CF s, and vice versa.Moreover, the parameters other than those associated to C and internal edges adjacent to C may be chosen to be identical on N and N ′ .
Furthermore, for strictly positive edge lengths on N and N ′ , there are two positive-measure subsets of parameters, Θ 1 , Θ 2 , for N , such that on Θ 1 the topologies of N and N ′ are not distinguishable using quartet CF s, and on Θ 2 are distinguishable.
Note that case (1) implies that if N is as shown in Figure 4.1(L) then N and N ′ also cannot be distinguished from the network obtained from N by interchanging the a and b labels.In case (2), if parameters are such that N is not distinguishable from N ′ , then case (1) implies that they are also not distinguishable from the two networks obtained by redesignating the hybrid node in the 3-cycle to be a singleton.When N is distinguishable from N ′ , then by case (1) it is distinguishable from those two other networks as well.
Proof.Let a, b denote the taxa in the singleton blocks.For case (1), we may assume the network is rooted, with the root outside C and not on the pendant edges leading to a, b (Figure 4.1 (L)).Then under the NMSC there is a probability p ∈ (0, 1), depending on the numerical parameters of the 3-cycle, that lineages a and b fail to coalesce before leaving the 3-cycle.Replacing the 3-cycle and its adjacent edges by a 3-leaf tree where the edge leading toward the n − 2 taxa has edge probability p leaves the distribution of topological gene trees, and hence quartet CF s, unchanged.Varying parameters over the 3-cycle or over the 3-leaf tree allows all probabilities p ∈ (0, 1) to be achieved.
In case (2), let v denote the hybrid node in the 3-cycle, so a, b are not descendants of v for any rooting.(Figure 4.1 (R)).The value of any CF involving at most one of a, b is determined by the network and numerical parameters below v, since as a gene tree forms either a coalescent event occurs below v, or 3 or 4 lineages reach v, so that all three gene quartet topologies have probability 1/3.Thus the 3-cycle only affects values of CF s involving both a and b, and only through events in which no coalescence has occurred below v.We may thus replace the cycle and its adjacent edges to a, b with any graphical structure and parameters that produce the same probabilities of gene quartet topologies when exactly two lineages enter at v.These conditional probabilities are the CFs of the quartet network shown in Figure 4.2: Note that we have dropped the subscripts 1, 2 from the c taxa, since by exchangeability of those lineages under the NMSC, they may be assigned arbitrarily.Now a quartet tree with topology ab|cc and internal edge probability z yields so, using (4.1), without changing the CF s the 3-cycle and edges to a, b in N could be replaced by a 3-leaf tree with an edge leading to an ab cherry having edge probability provided 0 < z < 1.Since this inequality holds on a set of positive measure in parameter space, on that set the topologies N and N ′ are not distinguishable.However, z > 1 also occurs on a set of positive measure.Suppose in this case that the edge e = (v, w) below v has as its child w a node outside of a cycle, and let c 1 , c 2 be taxa chosen from distinct taxon blocks below that node.Then if parameters on N are in the set determined by z > 1 and the edge probability p for e satisfies pz > 1, then for N CF ac|bc = pz/3 > 1/3.
Since for a quartet tree CF ac|bc < 1/3, N is distinguishable from N ′ on this set.If w is instead in a cycle, a similar argument applies.
This proof essentially follows arguments given in [6] for quartet networks with a 3 1 -cycle and 3 2 -cycle.In case (2) the parameters for which the 3-cycle is topologically identifiable are ones that make the quartet network anomalous, in the sense of [5]   f abc = 3CF ab|ac CF ab|bc − CF ab|ab is in I(T 5 ), but not in I(N 5−31 ) nor I(N 5−32 ).Using expressions for CFs in terms of parameters from Proposition B.2, f abc can be interpreted as expressing the total internal path length in the tree T 5 is the sum of the lengths of the two internal edges.This polynomial, and variants of it, will play an important role in identifying 3-cycles.The first result in this direction is the following.∈ I(N ) for the non-tree N , it does not vanish for all parameters on them, and is zero only for a set of measure zero in their parameter space.Thus generically the vanishing of f abc distinguishes T 5 from the others.differ from N a in which taxa are below the hybrid node.Thus the hybrid node of the 3-cycle in these three networks cannot be determined from purely algebraic conditions on CF s.While the vanishing of any of the three f abc , f bca , f cab (and hence all) distinguishes the tree T 6 from N a , N b , and N c , that was already implicit in Theorem 4.4.

3-cycles on
small networks: Semialgebraic conditions.While subsection 4.2.2 has shown the presence of a 3-cycle can be detected in some networks, that result pertains only to the undirected cycle.To obtain information on the hybrid node, we use a semialgebraic approach, focusing on polynomial inequalities.
if, and only if, Moreover, f abc is identical on the networks N 5−32 and N ′ 5−32 for the same parameter values, so f abc gives no information to distinguish between these.
Finally, there are positive measure subsets of the numerical parameter space for N 5−32 on which f abc < 0 and on which f abc > 0.
Proof.Theorem 4.4 states that f abc = 0 for generic parameters if, and only if, N = T 5 .If N = N 5−31 then using the formulas for CF s in Proposition B.3 gives, for γ, x, ℓ 1 , ℓ 2 ∈ (0, 1), Since f abc is invariant under interchanging the as and bs, its values for N 5−32 and N ′ 5−32 are the same.Specific examples of parameters on N 5−32 show both f abc < 0 and f abc > 0 can occur, and by continuity there are positive measure subsets of parameter space on which these occur.
If a 5-taxon network does have a 3-cycle C, then this proposition may provide some information on the hybrid node's location.For instance, f abc < 0 implies the taxon c which is not in a cherry on the tree obtained by contracting C to a vertex is also not a hybrid descendant of the 3-cycle.However, for other numerical parameters f abc > 0, in which case there is no information on the hybrid location.
To further develop semialgebraic tests for 3-cycle hybrid nodes, we again consider the 6-taxon networks N a , N b , N c described in Figure 4.4.Define the following functions of the CF s, building on the f xyz : Proposition 4.6.Under the NMSC, for CF s arising from the tree T 6 , G xyz = 0 for all x, y, z, while G xyz > 0 for CF s arising from the network N x .
If a network is known to have one of the topologies N a , N b , N c , then at least one of these topologies can be ruled out by the signs of G abc , G cab , G bca : If G xyz < 0 then the network is not N x .Finally, there are positive measure subsets of the numerical parameter space for N y and N z on which G xyz < 0 and on which G xyz > 0.
Proof.That G xyz = 0 for T 6 restates that G xyz ∈ I(T 6 ).Using formulas from Proposition B.6, for CF s from N a , Since G abc + G cab + G bca = 0 and one of these terms is positive for each of N a , N b , N c , at least one is negative.
One can find specific parameters on N x for which G xyz < 0 and G xyz > 0, and by continuity these conditions hold on sets of positive measure.Figure 4.5 illustrates the proposition, showing (G abc , G bca , G cab ) for randomly chosen numerical parameters on each of the networks N a , N b , N c , with color indicating the network topology.Since the points lie in a plane P through the origin, the axes have been rotated to view the plane orthogonally.The three planes G xyz = 0 intersect P in lines which divide the plot into six sectors.On three of these sectors exactly one color appears, indicating that the network topology is determined by the positivity of exactly one G xyz .On the 3 sectors where two colors appear, two of the G xyz are positive, so only one of the network topologies is ruled out.
Remark 4.7.It is natural to ask if f xyz or G xyz could be used to detect 3-cycles in situations where incomplete lineage sorting is negligible, so that all gene trees are displayed on the species network.This scenario is modeled by immediate coalescence of gene lineages on entering a common network edge or, equivalently, by a limiting model of the NMSC, in which all edge probabilities go to 0. (See [4, Section 6.2] for more details.)The formulae for CF s given in this work still apply, and for all the 5-taxon networks of Proposition 4.5 f abc = 0, while for all the 6-taxon networks of Proposition 4.6 G xyz = 0. Indeed, these functions depend only on CF s for quartets not displayed on the networks, which are therefore all zero.A coalescent process is thus essential to detecting 3-cycles with these functions.In fact, there is a neighborhood in V(N a ) = V(N b ) of the CF point of this example contained in the image of the parameterizations of both N a and N b .Indeed, a computation of the Jacobians for the two parameterization maps at the example parameters shows that locally the images are of dimension 6, which matches the dimension of the variety.A sufficiently small neighborhood of the CF point is thus in the image of the parameterizations for both N a and N b , with inverse images of positive measure.One may similarly show, using a CF point that arises only from N a (lying in a uniformly colored sector in Figure 4.5), that there is a set of positive measure in the N a parameter space which gives CF s in the image of the parametrization of N a only.We combine these results formally in the following theorem.Theorem 4.9.There exists a positive measure subset of the numerical parameter space of N a for which it is distinguishable from T 6 , N b , and N c , and a positive measure subset of the parameter space for which only the undirected network can be distinguished from T 6 , with 1 node in the 3-cycle determined to be non-hybrid.
Again using the parameter values in Example 4.8, an analog of this result for 5-taxon networks with a single 3-cycle can be established.Theorem 4.10.There exist positive measure subsets of the numerical parameter spaces of N 5−31 and N 5−32 for which the semidirected network topologies are distinguishable from the other networks among T 5 , N 5−31 , N 5−32 , N ′ 5−32 , and positive measure subset of the parameter spaces for which they are not distinguishable from at least one other of N 5−31 , N 5−32 , N ′ 5−32 .Proof.First, suppose the network is N 5−32 .Dropping a taxon to pass to a quartet network with a 3 2 -cycle, Proposition 4.3 implies that the semidirected topology is identifiable on some positive measure subset of parameters.That there is such a set on which the semidirected topology is not identifiable follows from using the parameter values of Example 4.8 (after dropping an appropriately chosen taxon) on such networks with different hybrid cherries, and computing Jacobians to verify that an open set of such examples exists.
To investigate identifiability for the network N 5−31 , consider the function We first show that f < 0 for all parameters on N 5−32 .Using Proposition B.4 to expand in terms of parameters, Since h 1 appears linearly in this expression with a negative coefficient, we set h 1 = 0 to bound f above.The coefficient of h 2 , which also appears linearly, may be positive or negative, so we consider h 2 = 0 and 1.If It is easy, however, to find an open set of parameters for N 5−31 for which f > 0, and on that set c is identifiable as the hybrid block.We obtain a set on which the semidirected topology of N 5−31 is not identifiable by again using the parameter values in Example 4.8.

4.2.4.
Large networks with 3-cycles.After considering specific 5-and 6taxon networks with a single 3-cycle, we shift focus to 3-cycles in general networks N + .We extend the previous results on semialgebraic identifiability of both cycles and hybrid nodes, using a decomposition of N + into 4 subnetworks, as in Figure 4.6.A similar decomposition is used in [10], of a level-1 network into trees and 'sunlets,' but that work does not model coalescence, so the details are quite different.Our decomposition extends to larger cycles but we present only the 3-cycle case needed here.
The subnetworks in Figure 4.6 are: D: The 3-cycle and its three adjacent cut edges, with pendant vertices a, b, c, where a is the child of the hybrid node of the cycle; A, B, C: The connected components containing a, b, c, respectively, when the edges and internal nodes of D are deleted from N + .Note that a, b, c are each in two of these subnetworks.Since the root must be above D's hybrid node, and the semidirected network is unchanged by moving the root along tree edges, we may assume the root lies in B or C, and, after renaming, in C.
The CF of any quartet under the NMSC on N + has an algebraic decomposition into terms associated to the subnetworks A, B, C, D, which we next develop.We use two facts about coalescent events between 4 lineages leading to gene quartets: 1.The first coalescent event between 2 of the lineages determines the gene quartet tree that forms, and 2. Conditioned on 3 or 4 lineages reaching a common node with no previous coalescence, by exchangeability of lineages each quartet has probability 1/3.For S ∈ {A, B, C, D} and a gene quartet xy|zw where x, y, z, w ∈ X are taxa on N + , we define an event, denoted C S → xy|zw, that captures whether the behavior of gene lineages in S ensures that under the coalescent model the gene tree xy|zw is formed, or will be, with a determined probability.This may be due to a coalescent event occurring in S, or 3 or 4 lineages reaching a common node in S without having yet coalesced.Since a coalescent event between the lineages occuring before they enter S would already determine the quartet tree, we define this event conditional on lineages from x, y, z, w entering S distinctly.More formally, consider the events: Let P (C S → xy|zw) denote the conditional probability of the event C S → xy|zw.Then with a i , b i , c i distinct taxa from A, B, C, respectively, a few example decompositions of CF s are: For calculating probabilities associated to D, we suppress indices on taxa.This is allowable since, conditioned on the lineages entering D distinctly, those from A are exchangeable, as are those from B. Thus, for instance, Significantly, all CF s for N + can be computed using only the following probabilities associated to D together with expressions dependent only on A, B, C: The 6 linear independent polynomials, p 1 , p 2 , . . ., p 6 parameterize a variety, V(D).Combined with the previous discussion of decomposing CF formulas, this yields the following.
Proposition 4.11.Let V N be the CF variety for a semidirected network (not necessarily level-1) N with the form shown in Figure 4.6, and numerical parameter space with the p i given above.Then the map CF : Θ(N ) → C 3( n 4 ) factors as where π is the map projecting Θ(N ) onto the numerical parameters on A, B, C only.
Proposition B.7 shows that V D = C 6 , and thus ϕ is an infinite-to-1 map, establishing the following.Corollary 4.12.Consider a semidirected topological network N with a 3-cycle, with decomposition as in Figure 4.6 (L).Then no test using polynomial equalities in quartet CF s can identify the hybrid node in the 3-cycle.
Specifically, if N's root must be in the subnetwork C because of the semidirected topology of C, then the network N B which has A, B interchanged from N = N A , so that B is below the 3-cycle's hybrid node, leads to the same ideal of invariants, that is, I(N A ) = I(N B ).If the semidirected topology of N allows for rooting in either subnetwork B or C, then Proof.If deleting the 3-cycle from the network induces a (n 1 , n 2 , n 3 ) partition of the taxa with all n i ≥ 2, then the corollary follows directly from Propositions 4.11 and B.7. Cases with n i = 1 then follow by deleting taxa from an appropriate network with all n i ≥ 2, intersecting the ideals with a ring generated by fewer CFs.
Note that this corollary applies to networks with more than one 3-cycle.However, when multiple cycles are present, the location of one cycle's hybrid node indicates that one of the nodes in a descendant cycle cannot be hybrid.Thus for a network with k 3-cycles, there are between 2 k and 3 k networks differing only in the choice of hybrid nodes in the 3-cycles, all of which are algebraically indistinguishable using CF s.
Nonetheless, using semialgebraic tests, we can obtain additional information on hybrid node location, as the following generalization of Proposition 4.6 shows.Proposition 4.13.Consider a partition of a taxon set X into three blocks of size at least 2. For any network N (not necessarily level-1) with a node or 3-cycle inducing these blocks, denote the node or 3-cycle and its adjacent edges by D, and the subgraphs attached to D as A, B, C (as in Figure 4.6 for a cycle).
Let G abc , G cab , G bca be as defined by equation (4.3), for any distinct taxa a i on A, b i on B, and c i on C. Then D is: for {x, y, z} = {a, b, c}. a 3-cycle and adjacent edges with if G xyz > 0, G yzx > 0, G zxy < 0 x or y below the hybrid node for {x, y, z} = {a, b, c}.
Moreover, for a network N with a 3-cycle D and descendants of x forming its hybrid block, there exist positive measure subsets of parameters on D on which G yzx and G zxy satisfy both of the above sign conditions.
Proof.First suppose D is a 3-cycle and, without loss of generality, A is below the hybrid node.Then we decompose formulas for CF s for N as Since But the terms in the last factor, all of which depend only on D, arise as multiples of CF s on the network N 6−32 of Since G abc + G cab + G bca = 0, either 1 or 2 of these terms are positive, and the two cases for 3-cycle Ds follow.The case of the network N with D a node is obtained by setting x = h 1 = h 2 = 0 in the formulas for any of the 3-cycle networks, showing, for instance, that G abc (N ) is a multiple of G abc (T 6 ) = 0.
The final statement on positive measure subsets of parameter space follows from Proposition 4.6.Proposition 4.13 yields the following generalization of Theorem 4.9.
Theorem 4.14.Consider a partition of a taxon set X into three blocks of size at least 2. Then for all networks (not necessarily level-1) with a node or 3-cycle inducing these blocks, the presence of the node or the (undirected) 3-cycle along with one nonhybrid block is identifiable.If the network has a 3-cycle then there are positive measure subsets of its parameter space on which the hybrid node can be determined, and on which it cannot.
Proof.By Proposition 4.13, an undirected 3-cycle is signaled by the non-vanishing of at least one of G abc , G bca or G cab , and for a 3-cycle, a non-hybrid block is identifiable since one of the Gs must be negative.That the hybrid node can be identified on a positive measure set follows from the existence of such a set for which only one G is positive.That the hybrid node cannot be identified on another set is seen by choosing specific parameters on 3-cycles with different hybrid nodes (e.g., using parameters given in Example 4.8 for the 3-cycle and adjacent edge parameters) which produce the same values for the p i .
If n i = 1 for some i, then similar arguments as given for Proposition 4.13 and Theorem 4.14 shows the function f xyz can identify the presence of a 3-cycle, but possibly not its hybrid node.While we omit the proof, we state the result.Proposition 4.15.Consider a partition of a taxon set X into three blocks of size 1, n 1 , n 2 with n i ≥ 2. For any network N (not necessarily level-1) with a node or 3-cycle inducing these blocks, let D denote the node or 3-cycle and adjacent edges, and A, B, C the subgraphs attached to D by the adjacent edges, with C being a single node.Let f abc be as in Proposition 4.5, for any distinct a i on A, b i on B, and c on C. Then for generic numerical parameters, D is: if, and only if, f abc = 0, a 3-cycle and adjacent edges with A or B below the hybrid node if f abc < 0, a 3-cycle and adjacent edges with A, B, or C below the hybrid node if f abc > 0.
Finally, there are positive measure subsets of the numerical parameter space for the networks with a 3-cycle D and either A or B below its hybrid node on which f abc < 0 and on which f abc > 0.
For identifying the hybrid node in a 3-cycle inducing a (1, n 1 , n 2 ) partition when there is a single descendant of the hybrid node, we generalize Theorem 4.10.
Theorem 4.16.Consider a partition of a taxon set X into three blocks of sizes 1, n 1 , n 2 with n i ≥ 2. Then for all networks (not necessarily level-1) with a 3-cycle We denote these by Ns, Nw, Nn from left to right, according to compass directions for the a 1 , a 2 cherry when the hybrid node is located at south.Note that Ne is omitted since, up to taxon labelling, it is the same as Nw.Edge probabilities and the hybridization parameter γ are shown next to edges.inducing these blocks, there are positive measure subsets of the parameter space on which the hybrid node of the 3-cycle is identifiable, and on which it is not.
Proof.Let f be as defined in (4.4).We first show the result for a general network N with a 3-cycle with a single hybrid descendant.For such a network, using decompositions as in Figure 4.6 but with C a hybrid singleton taxon and the root in B, we find that for any choices of two taxa in the A and B blocks where N 5−31 is given parameters from the 3-cycle and adjacent edges of N .Similarly, if a non-hybrid block C is the singleton where N 5−32 is given parameters from the 3-cycle and adjacent edges of N .Thus the signs of f on N can be used as in the proof of Theorem 4.10 to obtain the claim when the hybrid block is a singleton.
If the singleton block is not hybrid on N the claim is established as for Theorem 4.10, by passing to a subnetwork with a 3 2 -cycle and using the parameters of Example 4.8.

4-cycles.
To study topological 4-cycle identifiability beyond the results of [16] and [6], we consider first the networks N s , N w , N n on 5-taxa of Figure 4.7.Note that hybrid edge probabilities are not labeled for the networks N w and N n , since no coalescence can occur in those edges as they have only one descendant taxon.These three networks have equal subideals of linear invariants, as described in Appendix B, since the networks have the same undirected topology.Propositions B.8 to B.10, with additional computation, give an identifiability result for the hybrid nodes of 4-cycles in any level-1 network with at least 5 taxa.Proposition 4.17.Consider a semidirected binary level-1 network on n ≥ 5 taxa whose topology is known up to contracting 2-and 3-cycles and the direction of hybrid edges in 4-cycles.Then for generic numerical parameter values on the network, the 4-cycle hybrid edge directions are identifiable from CF s.
Proof.Suppose first a network N has exactly 5 taxa, and a 4-cycle.Then after contracting 2-cycles N yields N s , N w , N n , or one of the five networks shown in Figure 4.8.Although we do not know whether N has a 3-cycle, if it does then by Proposition 4.3 it has the same associated variety as the network with that 3-cycle contracted, so we investigate the relationships of V(N s ), V(N w ), and V(N n ).
Propositions B.8 to B.10 show that V(N s ), V(N w ), and V(N n ) have dimensions 5, 4, and 3, respectively.Moreover, V(N s ) contains both V(N w ) and has dimension 2, with three irreducible components whose ideals are Thus generic parameters for N s give points on neither V(N w ) nor V(N n ), while generic parameters for N w give points not on V(N n ), and generic parameters for N n give points not on V(N w ).Thus for generic parameters, the hybrid node in the 4cycle can be determined by testing invariants to see whether the CF s lie on V(N n ) or V(N w ), or neither.
If N has more than 5 taxa, choose one taxon from each of 3 of the taxon blocks determined by a 4-cycle, and 2 from the remaining block, and pass to the induced network on these 5 taxa to apply the result for 5-taxon networks.
Remark 4.18.The components V 1 , V 2 , and V 3 of V(N w ) ∩ V(N n ) arise naturally from the parameterizations.Restricting to γ = 1 on N w and γ = 0 on N n , essentially giving the unrooted tree ((a 1 , a 2 ), (b, c), d) for both, yields V 1 .V 2 arises from γ = 0 on N w and γ = 1 on N n which gives the unrooted tree ((a 1 , a 2 ), b, (c, d)).V 3 arises from ℓ = 0 on both N w and N n , which by corresponding to an infinite edge length, ensures a 1 , a 2 form a cherry in any gene tree involving those two taxa, and for those involving only one a i , gives CF s from a 4-cycle with unidentifiable hybrid node.4.4.Summary of topological identifiability.The results of this section combined with Theorem 3.3 yield the following theorem.Theorem 4.19 (Topological Identifiability from quartet CF s).Let N + be a binary level-1 phylogenetic network on n ≥ 5 taxa, with generic numerical parameters.Then no 2-cycle on the semidirected network can be identified, so let N be the topological semidirected network induced by N + with all 2-cycles replaced with edges.Then the topological structure of N , including directions of hybrid edges, is identifiable from quartet CF s of N + , with the following exceptions: 1.If a 3-cycle induces a (1, 1, n − 2) partition of taxa, then if the hybrid node has a single descendant taxon the network cannot be distinguished from the network in which the cycle is contracted to a node, or from the network in which the hybrid and other singleton block are interchanged.If the hybrid node has n − 2 descendant taxa, then there are positive-measure subsets of parameters on which the semidirected 3-cycle is and is not identifiable.2. If a 3-cycle induces a (1, n 1 , n 2 ) partition with n 1 , n 2 ≥ 2 then the undirected 3-cycle can be identified.There are positive measure subsets of parameters on which the semidirected 3-cycle is and is not identifiable.3.If a 3-cycle induces an (n 1 , n 2 , n 3 ) partition with all n i ≥ 2, then the undirected 3-cycle can be identified, and at least 1 of the 3-cycle nodes can be determined not to be hybrid, but there are positive measure subsets of parameters on which the semidirected 3-cycle is and is not identifiable.

Identifiability of numerical parameters.
To address identifiability of numerical parameters -both edge lengths and hybridization parameters -we assume the network has no 2-cycles, as these are not identifiable.For the remainder of the section we thus study N , the semidirected metric binary phylogenetic network induced from a rooted network N + , with 2-cycles replaced by edges.In showing an edge in N has identifiable length, we are showing that if the original network did have a 2-cycle, then an "effective" length of an edge resulting from replacing the cycle as in Lemma 4.1 is identifiable.
Since we assume exactly one sample per taxon for each gene, no coalescent event can occur in pendant edges.Thus no pendant edge length appears in CF parameterizations, and such lengths cannot be identified from CF s, yielding the following.Proposition 5.1.Let N be a semidirected phylogenetic network.Then pendant edge lengths are not identifiable from quartet CF s under the NMSC model with one sample per taxon.
5.1.Lengths of edges defined by 4 taxa.For some edges in N it is simple to identify the edge length.We first focus on one type of such edges.In an unrooted tree, every internal edge is defined by some Q, even if the tree is not binary.But for a network, even if binary and level-1, as in Figure 5.1, this is not the case: A k-cycle, with k ≥ 5, has k − 4 edges in it that are defined by such sets, with the hybrid edges and those adjacent to them exceptions, as will be proved in the next proposition.Edges descended from hybrid nodes are also never defined by a set Q.These examples show edges defined by a set Q need not be cut edges, and not all cut edges are defined by a set Q.
For a binary network, an alternate characterization of edges defined by sets Q can be given.
Proposition 5.3.For a binary level-1 semidirected network N , there is a set Q of 4 taxa defining an edge e if, and only if, e is an internal edge that is neither hybrid nor adjacent to a hybrid edge.
Proof.Suppose e is defined by Q.If e were either hybrid or adjacent to a hybrid edge, then Q would contain a descendant of a hybrid node.But then N (Q) contains all edges of the cycle in which the hybrid edge lies.This contradicts that both e and its adjacent edges are cut edges in N (Q), since the hybrid edges are not cut.
Conversely, suppose e is neither hybrid nor adjacent to a hybrid edge.If none of these 5 edges is in a cycle in N , then choosing one taxon in each component obtained by deleting e and its incident nodes and adjacent edges gives a set Q defining e.
If any one of these edges is in a cycle, then since N is level-1 and binary, exactly one of the following holds: a) e is in a cycle, together with exactly 2 adjacent edges, one at each endpoint of e, b) e is not in a cycle, but exactly one cycle contains two edges adjacent to e at the same endpoint of e, or c) e is not in a cycle, but all 4 edges adjacent to e are, with e adjacent to two different cycles.
For case (a), the 2 edges adjacent to e that are not in the cycle must be cut edges, and the two adjacent to e that are in the cycle must be adjacent to 2 other distinct cut edges not in the cycle.Choosing taxa from the non-e components left by deleting these 4 cut edges gives a set Q defining e.
In case (b), The two edges in the cycle must be adjacent to distinct cut edges other than e which are not in the cycle.Choosing taxa from the non-e components of the graph obtained by deleting these two edges and the two non-cycle edges adjacent to e gives a quartet defining e. Case (c) is similar, treating each cycle the same way.
For any network, regardless of level or other special structure, lengths of edges defined by sets Q are easily identified.Proposition 5.4.If an edge e in a metric network N is defined by a set Q of 4 taxa, then its length is identifiable from quartet CF s.
Proof.If e is defined by Q = {a, b, c, d} has length t and in N | Q induces the split ab|cd, then CF ac|bd = exp(−t)/3, so t = − log(3CF ac|bd ).

Numerical parameters associated to 3-cycles.
Edges either in or adjacent to a 3-cycle are always adjacent to a hybrid edge.Thus in binary networks, these edges are not defined by sets of 4 taxa, so Proposition 5.4 does not apply.Proposition B.3(c), Proposition B.6(c) and Proposition B.4(c) illustrate that, at least for specific small networks, the numerical parameters associated to 3-cycles are not identifiable.More generally, we obtain the following.Proposition 5.5.If C is a 3-cycle on a semidirected binary level-1 network N , then neither the hybridization parameters nor the lengths of any edges in or adjacent to C can be identified from quartet CF s.
Proof.Suppose first the 3-cycle induces an (n 1 , n 2 , n 3 )-partition of the taxa with all n i ≥ 2. Then using Proposition 4.11 and B.7(a) we see that the map from numerical parameters to CF s factors by sending the 7 numerical parameters associated to the 3-cycle and its adjacent edges into a 6-dimensional variety.This implies that the numerical parameters cannot all be identifiable.To see that no single parameter can be identified, first observe that from the factorization of maps in (4.5), if a single parameter were identifiable, it would have to be identifiable from a point in V D .However, Proposition B.7(b) shows that is not the case.
If a 3-cycle induces a (1, n 2 , n 3 )-or (1, 1, n 3 )-partition of taxa, then by considering samples of 2 individuals for each gene from the singleton taxa, we can modify the network by attaching cherries of pseudotaxa for each singleton.Since in this case we already know that numerical parameters around the 3-cycle are not identifiable from all CF s, with access only to CF s using only one of the pseudotaxa, they are still not identifiable.But that means they are not identifiable for the original network.

Other numerical parameters.
The remaining numerical parameters on a binary level-1 network to be considered include lengths of hybrid edges, lengths of edges adjacent to hybrid edges, and hybridization parameters, all when the relevant cycle is of size ≥ 4.
Proposition 5.6.Let N be a level-1 metric binary semidirected network with no 2-cycles, containing a k-cycle C with k ≥ 5. Then hybridization parameters and lengths of the cycle edges adjacent to the hybrid edges in C can be identified from quartet CF s.If the hybrid node of C has at least 2 descendant taxa, the lengths of the hybrid edges can also be identified.If the hybrid node has only one descendant taxon then the lengths of the hybrid edges are not identifiable.
Proof.From Proposition 5.4 we already know that the k − 4 edges in the cycle that are not hybrid or adjacent to a hybrid edge have identifiable lengths.If the taxon blocks for the cycle are, proceeding from the hybrid around the cycle, X 1 , X 2 , . . ., X k , then pick one taxon from each of X 1 , X 2 , X 3 , X 4 , and X k and pass to the induced subnetwork.Replacing any 2-cycles with edges, we may assume we have a 5-cycle sunlet network as in Figure 5.2(L), in which the edge probability y of the edge opposite the hybrid node is known, and the edge probability x is that of the edge in C which is adjacent to a hybrid edge, lying between blocks X 2 and X 3 .
Using y and CF s we can identify γ, and then x through Similarly, the other edge in C adjacent to a hybrid edge has identifiable length.
If the hybrid node has 2 descendant taxa, then by picking two taxa from X 1 and one from each of X 2 , X 3 , X k we pass to an induced subnetwork which, after replacing 2-cycles by edges, has the form of the network of Figure 5.2(C) or (R) with the same hybrid edge lengths as the full network.In case (C), a cherry below the hybrid node, applying the result of Proposition B.8 on N S identifies the hybrid edge lengths from CF s using the already identified γ.In case (R), a 3 1 -cycle below the hybrid node, by Proposition 4.3 all CF s are unchanged if the 3-cycle is contracted to a node and the edge length above it modified appropriately.Then the identifiability of the hybrid edge lengths follows from the cherry case.
If the hybrid node has only 1 descendant taxon, then at most 1 lineage may enter (going backwards in time) the hybrid edges of C, so no coalescent events may occur on the hybrid edges.Thus the CF s do not depend on the lengths of those edges, which are therefore not identifiable from CF s.
We next turn to cut edges adjacent to a single hybrid edge.
Proposition 5.7.Let N be a level-1 metric binary semidirected network with no 2-cycles, containing an internal cut edge e adjacent to exactly one hybrid edge (at its non-hybrid node), with the hybrid edge in a k-cycle.If k ≥ 4, then the length of e is identifiable.
Proof.If k ≥ 4, by passing to the induced network on a subset of the taxa, we may assume k = 4. Since e is not pendant, and not adjacent to a hybrid edge of another cycle, after again passing to an induced subnetwork and replacing any 2-cycles with single edges, we may assume the network has the structure of N w in Figure 4.7, with e the edge joining the cherry to the 4-cycle.But then Proposition B.9(c) gives the claim.
If an edge is adjacent to hybrid edges at both of its endpoints, but neither endpoint is a hybrid node, as in Figure 5.3 (L), then the following applies.Proposition 5.8.Let N be a level-1 metric binary semidirected network with no 2-cycles, containing an edge e adjacent to exactly two hybrid edges which lie in two different cycles.If the sizes of both cycles are ≥ 4, then the length of e is identifiable.
Proof.If both cycles are of size ≥ 4, then the network has an induced subnetwork which, after suppressing 2-cycles has the form shown in Figure 5.3(L), with the central edge arising from e, with edge probability ℓ.
Using Proposition B.9 on the induced network after dropping taxon f we may identify γ, x 1 , x 2 and the product ℓy 1 .Similarly, dropping a we may identify y 1 , which then gives ℓ.
Next we consider edges adjacent to two hybrid edges at one endpoint, that is, edges with a hybrid node as an endpoint, as in Figure 5. 3(C,R).If the hybrid node is in a large cycle we obtain the following.Proposition 5.9.Let N be a level-1 metric binary semidirected network with no 2-cycles, containing an edge q whose parent is the hybrid node of a k-cycle with k ≥ 5.If q has at least two descendant taxa, and the child node of q is not in a 3-cycle, then the length of q is identifiable.
Proof.Since the cycle is of size ≥ 5, by Proposition 5.6 its hybridization parameter γ is identified.
First suppose the child node of q is not incident to a hybrid edge.If q has two descendant taxa, there is an induced subnetwork which, after replacing 2-cycles by edges, has the form of N s of Figure 4.7(L), with q the child edge of the hybrid node.With γ in hand, by Proposition B.8(c) the length of q is identified.
If instead the child node of q is incident to a hybrid edge, assume that edge lies in a cycle of size ≥ 4. We may then pass to a network with the structure of Figure 5.3(R) where q is the edge joining the two cycles.But dropping taxon f again yields a network of form N s , so using γ we identify ℓy 1 .Instead dropping b from Figure 5.3(R), by Proposition 4.3, the 3-cycle on this can then be contracted to a node, adjusting the edge length of q (now possibly negative) so CF s are unchanged.Then Proposition B.9 can be applied to identify y 1 .Thus ℓ is identifiable.
The remaining parameters to consider are the edge probabilities and hybridization parameter in 4-cycles, and the edge probability of the child edge of the hybrid node in a 4-cycle.Identifiability of these is more complicated, as it can depend on the sizes of the taxon blocks of the cycle.In handling these cases, we use the following.
Lemma 5.10.Consider a 6-taxon semidirected network with a 4-cycle, a cherry below the cycle's hybrid node, and one other cherry, as shown in Figure 5.4.Then all numerical parameters are identifiable from quartet CF s.
i) the child of the edge with probability ℓ is not in a 3-cycle: all identifiable ii) the child of the edge with probability ℓ is in a 3-cycle: Simple instances of the 5 cases in the proposition may be helpful to consider.The network N s falls under case a), N n under b)i), N w under b)ii), and N sw and N sn under c)i).Examples for case c)ii) are obtained from N sw and N sn by replacing the cherry below the hybrid edge with a 3-cycle.The proof of the proposition leverages computational results for these to obtain more general statements.
Proof.That at least one of these cases must hold is most easily seen by noting that case c) is the complement of the union of a) and b).We consider each case to establish its claim.Case a): The 4-cycle determines a hybrid block of taxa A and three taxa, b, c, d, in singleton blocks.The only CF s dependent on the parameters θ = (x 1 , x 2 , h 1 , h 2 , γ, ℓ) are those involving at most two elements of A, since with 3 or 4 elements of A either a coalescence has occurred below the hybrid node, or at least 3 lineages reach it and are then exchangeable, giving probabilities 1/3 for each quartet tree.Those CF s dependent on θ decompose into sums of products of expressions involving only parameters outside of θ or only parameters in θ, similar to the approach in Subsec-tion 4.2.4.The expressions involving only parameters in θ can even be chosen from the CF s for the network N s of B.8.But that Proposition shows the parameters in θ are not identifiable from the CF s for N s , so they cannot be identified from those for N .

Case b)i):
The 4-cycle determines a hybrid singleton a, two adjacent singleton blocks of b and d, and a larger subnetwork C opposite the hybrid.Viewing the network as rooted in C, the CF s for N depend on parameters x 1 , x 2 , h 1 , h 2 , γ, ℓ only through the various probabilities of first coalescent events among subsets of {a, b, d} determining the quartet tree before lineages leave the 4-cycle and enter C .Using D to denote the subnetwork below C which contains the 4-cycle, these are Since these probabilities are linear functions of p 1 , p 2 , and none of γ, x 1 , x 2 are identifiable from p 1 , p 2 , none of the parameters are identifiable from CF s for N .

Case b)ii):
Pick two taxa in one of the blocks adjacent to the hybrid one, and one taxon in all others.Passing to the induced subnetwork and removing 2-cycles yields either a network with the form N w or one where the cherry in N w is replaced by a 3-cycle.Using Proposition 4.3, we may replace such a 3-cycle with a node without changing CF s, (provided we modify the edge length leading to the 4-cycle, including allowing for a possibly negative branch length.But then the network has the form N w and applying Proposition B.9 shows γ, x 1 , x 2 can be identified.
Since there is only one taxon descended from the hybrid node, there can be no coalescent event in either of the hybrid edges or their descendant, and thus these edge lengths do not appear in the formulas for the CF s for N .Therefore these parameters cannot be identifiable.

Case c)i):
Pick two taxa in one of the non-hybrid blocks, two taxa in the hybrid block, and one taxon from each of the others.Passing to the induced subnetwork on these 6 taxa, and removing any 2-cycles, we obtain a network of one of the forms in Figure 5.4, or ones where 3-cycles appear in place of one or both cherry nodes.If there are 3-cycles, by Proposition 4.3 we may replace them with nodes without changing CF s (provided we modify edge lengths leading to the 4-cycle).Then using Lemma 5.10 we can identify γ, x 1 , x 2 , h 1 , h 2 .
To identify ℓ, let v be the child node of the edge with this probability.If v is not in a cycle in N , then picking one taxon descended from each of its child edges and passing to an induced subnetwork, ℓ is identifiable by Lemma 5.10.
If v is in a cycle, it is of size ≥ 4. Passing to an induced subnetwork, we may assume that v is in a 4-cycle.Note that v cannot be the hybrid node of that cycle, else the semidirected network would not be rootable.If v is opposite the hybrid node, then we may pass to an induced subnetwork which, after replacing 2-cycles with edges, has a cherry below v and follow the previous argument.If v is adjacent to the hybrid node, then the subnetwork has the form of Figure 5.3(R).Since γ is identified, the argument used in Proposition 5.9 then shows ℓ is identifiable.

Case c)ii):
The argument of the first paragraph for Case c)i) shows γ, x 1 , x 2 , h 1 , h 2 are identifiable.Since the edge descending from the hybrid node of the 4-cycle is incident to a 3-cycle, its length is not identifiable by Proposition 5.5.5.4.Summary of numerical parameter identifiability.We summarize this section's results with the following.Theorem 5.12 (Numerical parameter identifiability from quartet CF s).Let N be a level-1 metric binary semidirected network with no 2-cycles.Then from quartet CF s under the NMSC with one sample per taxon all numerical parameters on N are identifiable except for the following, which are not identifiable: 1. pendant edge lengths, 2. for 3-cycles, hybridization parameters and the lengths of the six edges in and adjacent to the cycle, 3. for 4-cycles, the hybridization parameter and edge lengths in the cycle and descended from the hybrid node, as stated in Proposition 5.11.
If two individuals are sampled in some taxon x, as discussed earlier this can be modeled by attaching a cherry of pseudotaxa x 1 , x 2 at the leaf x, Doing so for all taxa resolves the non-identifiability issues of Items 1 and 3, yielding the following.
Corollary 5.13.Let N be a level-1 metric binary semidirected network with no 2-cycles.Then from quartet CF s under the NMSC with two or more samples for all taxa, all numerical parameters on N are identifiable except hybridization parameters and lengths of edges in and adjacent to 3-cycles.
6. Implications for data analysis.Attempting to infer the non-identifiable can either be misleading (unless all possible alternatives are reported), or very slow (spending computational time considering equally good possibilities), so our results here should inform development and use of CF -based inference methods.
The nature of 3-cycle identifiability from quartet CF s poses particular issues for likelihood and pseudolikelihood approaches.Quartet CFs carry signals of undirected 3-cycles, so ignoring the possibility of such cycles could have unknown consequences.But even if only the network topology is sought, these criteria require optimization over numerical parameters, so it seems necessary to include 3-cycle parameters in a search.Since for some parameter values there is a signal of a 3-cycle's hybrid node in the CF s, the search cannot be limited to undirected 3-cycles.However, these numerical parameters are not themselves identifiable, so considering them will be slow.Reducing the over-parameterization at 3-cycles (from 7 parameters to 3) would be desirable, but it is unclear how to do so while maintaining the same range of CF s.As the numerical parameters vary, the semidirected topology may pass between identifiable and non-identifiable regimes, and the boundaries of these are not known.
SNaQ, with its default settings, ignores the possibility of a 3-cycle and it is unclear how this might affect its optimization.Exactly what information in CF s is extracted by maximizing the pseudolikelihood function is difficult to analyze theoretically.Using simulation, the impact of 3-cycles on inference needs to be studied thoroughly, both for SNaQ and for PhyloNet's similar inference from rooted triples.
NANUQ does not suffer from these problems, as its inference goal is more modest, providing a statistically consistent estimate only of larger cycle topology, without any search over the numerical parameter space.Whether NANUQ can be supplemented to extract CF information on the existence of 3-cycles should also be explored.
Finally, identifiability theorems needed to justify network inference methods from data types other than CF s are largely lacking.Studies of the parameter identifiability question for these are also needed.

Appendix A. Context for this work.
A.1.Earlier work.Here we discuss how results in this work fit into and complement earlier work.In what follows, all statements should be restricted to the class of level-1 binary networks.
Questions of identifiability of both network topology and numerical parameters from quartet CF s were first considered in [16], a work which also introduced the pseudolikelihood inference of networks from empirical CF s via the SNaQ software.With the publication directed primarily to a biological audience, much of the presentation of the mathematical results appeared in the supplementary material, without formal statements of propositions and theorems, making it difficult to discern exactly what has been fully established.Subsequently, [6] reconsidered the identifiability of level-1 network topology from quartet CF s with more mathematical formality.A later preprint [17] gives more details on the work behind [16].
As does this paper, [16] uses algebraic computations to study what information quartet CF s might carry about a network.However, the computations concerning network topology are focused not on the full question of identifiability, but on a more restricted question the authors called detectability of hybridizations.This notion is an instance of the broader concept of distinguishability presented subsequently in [8].Distinguishability for a specified collection of possible models using some specified information source means that a model x that is known to be in the collection can be determined from that information for x.In [16] the models correspond to certain networks, and the information is expected CF s arising from them under the NMSC.As stated in [16], . . .for networks with n ≥ 4 taxa, we restrict our focus to the case when N ′ is the network topology obtained from N by removing a single hybrid edge of interest.. . .The presence of the hybridization of interest can be detected if the quartet CF s from N ′ cannot all be equal to the quartet CF s from N simultaneously.In other words, that work focused on distinguishability of the 2-element set {N, N ′ } using quartet CF s.While investigating this question was certainly a strong first contribution to the broader question of identifiability of level-1 network topology from CF s, addressing the full question would require showing distinguishability of the set of all possible level-1 networks on a taxon set X, including networks with completely unrelated topologies.More fully, one would need to show this for sets X of arbitrary size.The approach taken to prove detectability by [16], however, depends on equating formulas for CF s in terms of numerical parameters on the two networks and solving the system, and it is unclear how this could be applied to the full identifiability question.
Note that while [16] states its detectability results extend to level-1 networks with many cycles (where multiple hybrid edges may be removed to get the set of distinguishable networks), a justification for this claim was only given in [17], with Lemma 3 of that work being key.Unfortunately, that lemma is incorrect as stated.(See Appendix A.2 of this appendix for a counterexample.)While it might be possible to obtain a correct justification using similar ideas, doing so seems unnecessary given the results of [6], which we describe next.
After presenting a detailed study of all level-1 networks on 4 taxa and their CF s, [6] used combinatorial arguments to show that information about larger networks could be obtained through induced quartet networks, and hence CF s.In particular, the topological identifiability result stated in this paper as Theorem 3.3 was established.This work also clarified several points that were either implicit or unaddressed in [16]: First, the semidirected network was formally defined, highlighting that network structure above the LSA needed to be excised.Second, identifiability of large cycles (more than 4 edges) was explicitly addressed (although [17] later provided an argument for detectability).Third, identifiability for networks with multiple cycles was shown.However, because it considered only one CF at a time to deduce information about an induced quartet network, without exploring whether additional information might be found in the relationship of CF s for overlapping sets of taxa, it did not obtain the strongest possible result.As an example, while [16] showed the distinguishability of the hybrid node of a 4-cycle in certain sufficiently large networks, [6] left all 4-cycles undirected in its network identifiability result.
While [16] had shown in some cases 3-cycles were detectable, the main result of [6] omits 3-cycles in its identifiability result.It did contain a theorem, though, suggesting a 3 2 -cycle on a 4-taxon network may be identifiable for some parameter values.In both works, then, questions about 3-cycle identifiability were left open.
In short, the general question, in arbitrary level-1 networks, of topological identifiability of both directed and undirected 3-cycles and of directed 4-cycles, the focus of subsections 4.2 and 4.3 of this work, remained.
Although not considered by [6], the study of identifiability of numerical parameters was also initiated by [16].Using calculations of Gröbner bases of ideals of polynomial relationships between CF s, arguments in [16] (with a technical matter corrected in [17]) investigated whether the dimension of the associated algebraic variety allowed for all numerical parameters to be identifiable.If this dimension is less than the number of parameters, then not all parameters can be identified, though it is possible that a subset are.If the dimension and number of parameters are equal, then the parameterization map must be generically finite-to-1.For specific networks [16] showed whether or not the parameterization was finite-to-1, but passing to networks with more than 1 cycle again depended on the faulty Lemma 3 of [17].Moreover, the calculations in [16] are focused on numerical parameters associated to cycles (cycle edge lengths, hybridization parameters, and possibly lengths of edges adjacent to a cycle).Although the arguments do not seem to cover edges not adjacent to any cycles, this omission is easily overcome (for instance as in our Proposition 5.4).On the other hand, it is unclear how the computations can be applied to answer whether the edges adjacent to hybrid edges in two different cycles have identifiable lengths.Moreover, when a dimension computation leads to a valid conclusion that a full set of numerical parameters cannot be identified, it still is possible that some subset of the parameters could be.This issue was not explored.
Finally, establishing that there can only be finitely many choices of parameters that yield the CF s on a network is a local identifiability result, and [16] emphasized it leaves open the question of global identifiability: Does this finite set actually contain only one such choice?This cannot be settled by dimension computations of the sort in [16].There are in fact simple statistical models (outside of phylogenetics) where parameterizations are finite-to-1 but not 1-1 in surprising ways [3].Investigating global identifiability is thus highly desirable.
Several questions of identifiability of numerical parameters questions thus remained open: identifiability of certain edge lengths, especially in multicycle networks; identifiability of subsets of parameters when a full set was not identifiable; and the broad question of global identifiability.These questions are the focus of Section 5.
As the manuscript for this work was being completed, the preprint [19] appeared.This includes work on distinguishability of several different 6-taxon networks with 4-cycles and several cherries, but not on the full network identifiability questions even for 6-taxon networks with 4-cycles.It does however include simulation work to investigate practical identifiability, the extent to which with finite data sets one can infer such cycles, using several pseudolikelihood methods.This is an instance of the construction implied in the statement of Lemma 3 of [17], which is used to justify analyzing CF s only of level-1 networks with a single cycle to understand those with multiple cycles.We show here that the statement of Lemma 3 is not correct for this pair of networks.
We first note that this lemma states that the CF 's for the two networks which depend on parameters associated to the cycle are equal.This is not strictly true as stated: Focusing on the lower cycle of the left network, for instance, CF ab|ce depends on γ, x 1 as well as on δ for the left network, but for the right network has no dependence on δ.Other interpretations of this statement are that the described set of CF s for the two networks produce the same values as parameters vary, or that the varieties defined by the parameterized CF s are the same.As this last interpretation is the broadest, we show it is also false.
Consider CF ac|df and CF ac|ef for the two networks.For the left network which are not equal for generic parameter values.However, for the right network since d, e can be exchanged in the graph because they form a cherry.Thus the variety for the right network has a defining equation that does not hold for the left.This example illustrates an important point that in passing to smaller induced networks a cycle other than the one of immediate focus may have an impact on CF s.In particular, a more detailed treatment of networks with multiple cycles, such as our work in Subsection 4.3, seems to be a necessary part of addressing the full identifiability question.

Appendix B. Propositions from Computations.
B.1.Ideals of CF invariants.Several observations simplify investigating the ideals I(N ) described in Section 3. First, since the entries of any vector concordance factor add to 1, at most two of the 3 entries of a CF are linearly independent.We consider the polynomials expressing that CF entries add to 1 to be trivial invariants, as they hold for all network topologies.If only 4 taxa a, b, c, d are considered, then the trivial invariant is CF ab|cd + CF ac|bd + CF ad|bc − 1, while for n taxa there are n 4 such invariants, one for each possible quartet.Second, if there is a cut edge on an induced quartet network separating two of the taxa, a, b from the others c, d, then the CF s associated with the discordant topologies ac|bd and ad|bc must be equal.This was shown in the level-1 case in [6] and for general networks in [1].The invariant expressing this is We call such polynomials cut invariants.
Third, when a network has one or more cherries (2 taxa joined by pendant edges to a common node), we will label the taxa in each cherry by the same subscripted letter, such as a 1 , a 2 .(See for instance Figures 4.3 and 4.4.)In CF s involving such taxa, we may then suppress the subscripts, since under the NMSC model on a suitably rooted version of the semidirected network the taxa a 1 and a 2 are exchangeable, giving the same CF values when they are interchanged.Thus, for example, on a network with a cherry formed by a 1 , a 2 , CF ab|cd = CF a1b|cd = CF a2b|cd , and CF ab|ac = CF a1b|a2c = CF a2b|a1c .
We call these exchange invariants, but note that some of these these are also cut invariants.For computations, our notational simplification of surpressing subscripts allows for their omission.
Definition B.1.For a topological phylogenetic network N, the ideal generated by all trivial, cut, and exchange invariants of N is denoted J (N ).
Note that J (N ) depends on the topology of the network N due to the cut and exchange invariants.However, it only depends on the undirected topology.Also, while J (N ) ⊆ I(N ) these ideals are typically not equal.Since the generators of J (N ) are linear, and simple to enumerate, we use them for removing some CF s from calculations of I(N ), and for stating results on ideal generators more succinctly.
The variety V N5−3 1 has dimension 3. (c) The numerical parameters γ, x, ℓ 1 and ℓ 2 cannot be determined from the quartet CF s.They can be determined with one degree of freedom, for instance if ℓ 1 is known: ).The variety V N5−3 2 has dimension 3. (c) The 6 numerical parameters γ, h 1 , h 2 , x, ℓ 1 and ℓ 2 cannot be determined from the quartet CF s.They can be determined with three degrees of freedom.If γ, ℓ 1 and ℓ 2 are known then: Proposition B.5.Let T 6 be the 6-taxon tree with a central node and 3 cherries, as in Figure 4.4 (R).Then, (a) The quartet concordance factors of T 6 are The variety V T6 has dimension 3. (c) The numerical parameters ℓ 1 , ℓ 2 , ℓ 3 can be determined from the quartet CF s by: ).The variety V N6−3 2 has dimension 6. (c) The 7 numerical parameters γ, x, ℓ 1 , ℓ 2 , ℓ 3 , h 1 , h 2 cannot be determined from the quartet CF s.They can be determined with one degree of freedom, for instance if ℓ 3 is given: (c) The parameter ℓ can be identified from the quartet CF s: ℓ = 3CF a 1 ba 2 d , but x 1 , x 2 , and γ cannot be.However, the quantities x 1 γ −γ and x 2 (γ −1)−γ can be determined from the quartet CF s, so if γ is known The variety V Nsn has dimension 7. (c) The numerical parameters γ, h 1 , h 2 , x 1 , x 2 , ℓ 1 and ℓ 2 can be determined from the quartet CF s.

Fig. 4 . 3 .
Fig. 4.3.(L) The 5-taxon unrooted binary tree T 5 ; (C) the 5-taxon network N 5−3 1 with a 3 1 -cycle; and (R) the 5-taxon network N 5−3 2 with a 3 2 -cycle, with numerical parameters shown.Edge probabilities of hybrid edges in N 5−3 1 and of pendant edges in networks are omitted, since they do not appear in formulas for the CF s.

Fig. 4 . 4 .
Fig. 4.4.(L) The 6-taxon tree T 6 with three cherries.(R) The 6-taxon network Na with a central 3-cycle surrounded by 3 cherries, with a 1 , a 2 descending from the hybrid node.The network N b is obtained by 'rotating' the three pairs of taxa so b 1 , b 2 descend from the hybrid node, and similarly for Nc.

Theorem 4 . 4 .
Under the NMSC model, the vanishing of f abc distinguishes a 5taxon unrooted tree T 5 from the 5-taxon semidirected networks with a central 3-cycle whose contraction yields the tree T 5 , for generic numerical parameters.Proof.Consider the networks ofFigure 4.3 and a fourth obtained by interchanging the a, b taxa in Figure 4.3(R).Since f abc /

Propositions B. 2
to B.4 also show that the two 5-taxon networks of Figure 4.3 have the same associated ideals, I(N 5−31 ) = I(N 5−32 ) ⊂ I(T 5 ).As a result, there is no purely algebraic means (using only polynomial equalities) of distinguishing them using CF s.Computational results for the 6-taxon networks T 6 and N a of Figure 4.4 appear in Propositions B.5 and B.6.Note that I(T 6 ) contains 3 polynomials, f abc , f bca , f cab , none of which are in I(N a ), expressing three different internal path length relationships in the tree.Proposition B.6 implies I(N a ) = I(N b ) = I(N c ), where N b and N c

Proposition 4 . 5 .
Let N be one of T 5 , N 5−31 , N 5−32 of Figure 4.3, or the network N ′ 5−32 obtained from interchanging the a, b taxon labels on N 5−32 .Let f abc be as in (4.2).Then for generic numerical parameters under the NMSC,

Fig. 4 . 5 .
Fig. 4.5.Values of (G abc , G bca , G cab ) plotted in three dimensions, for random numerical parameter values on each of the three networks Na, N b , Nc. Color indicates network topology.Plotted points lie in the plane x + y + z = 0, which is viewed orthogonally.The three coordinate planes x = 0, y = 0, z = 0 intersect this plane in the colored lines, separating the points by color into overlapping half-planes.Numerical parameters for networks were chosen uniformly from the interval [0, 1].

Proposition 4 . 6 and
Figure 4.5 suggest determining which of N a , N b , or N c produced certain numerical CF s may be impossible, which we rigorously show by the following example.

Example 4 . 8 (
Non-identifiability of the hybrid node in a 3-cycle).Consider the network N a with parameters γ (a) the network N b with parameters for N b are as shown for N a in Figure 4.4 but with taxon labels (a 1 , a 2 ), (b 1 , b 2 ) and (c 1 , c 2 ) replaced by (b 1 , b 2 ), (c 1 , c 2 ) and (a 1 , a 2 ) respectively.Then the CF s of N a and N b are equal.Specifically, for both N a and N b ,

Fig. 4 . 6 .
Fig. 4.6.(L) A decomposition of a level-1 network N + with a 3-cycle into 4 subnetworks, denoted A, B, C, D, with root in C. (R) The semidirected 3-cycle network N 6−3 2 with 3 cherries, which is a simple instance of the network on the left.
xy|zw) = No coalescence between any of the lineages x, y, z, w that may enter S occurs before they enter S. F = F (S, xy|zw) = 3 or 4 of the x, y, z, w lineages reach a common node in S before any coalescence, and afterwards xy|zw forms.G = G(S, xy|zw) = A first coalescence occurs in S between x, y or between z, w, without 3 or 4 lineages having reached a common node previously Then C S → xy|zw denotes (F ∪ G)|E.

2 x2Fig. 4 . 7 .
Fig. 4.7.The semidirected 5-taxon binary networks with a single 4-cycle, up to taxon labelling.We denote these by Ns, Nw, Nn from left to right, according to compass directions for the a 1 , a 2 cherry when the hybrid node is located at south.Note that Ne is omitted since, up to taxon labelling, it is the same as Nw.Edge probabilities and the hybridization parameter γ are shown next to edges.

Definition 5 . 2 .
Let e be an edge in N .Then we say e is defined by a set Q = {a, b, c, d} of 4 taxa if:1.Edge e lies in the subnetwork N (Q) of N composed of all edges and nodes which form the induced N | Q once degree 2 nodes are suppressed, 2. Edge e is a cut edge of N (Q) separating pairs of taxa, say a, b from c, d, and 3.In N (Q) there are 4 cut edges adjacent to e, separating each of a, b, c, d, respectively, from the others.

A. 2 .
An example.Consider the two networks shown in Figure A.1, where the right network is obtained from the left by removing a hybrid edge in the top 4-cycle.

Fig. A. 1 .
Fig. A.1.(L) A network with a 4-cycle of interest at bottom, and (R) a network with a single cycle obtained by removing a hybrid edge from each pair not in the cycle of interest.